Fix (all?) parallel hyperopt issues by scarlehoff · Pull Request #2415 · NNPDF/nnpdf

scarlehoff · 2026-01-09T14:32:09Z

With this PR we can now submit to multiple nodes in parallel (technically even to multiple clusters if you are feeling lucky).
The memory footprint of mongodb itself is much smaller as well.

Now an instance of n3fit can only spawn one single mongo worker, so it doesn't matter whether they are running in the same node or not, every worker connects to the database in the same manner. The num mongo workers option has been removed.
Now when a parallel hyperopt is going to run, each worker checks whether there's a database already running, if there is none, it starts the database and writes down the address. All other instances of n3fit will find an address and will try to connect.

There's also no longer a --restart option. Every new run is always a --restart. If you didn't want to continue a previous run just change the name of the runcard, but overwritting previous runs is very impolite.

Also, the database is no longer written in a separate folder then compressed but only when finishing cleanly and so on. Now the database is always in the nnfit folder, it will be compressed at vp-upload but not before (this is a problem because it is so big...), but only if the --upload-db option is used. Otherwise it skips the database during the upload. This makes the hyperopt not as heavy on the nnpdf server.

To do:

Update the documentation to reflect these changes
Make sure the automatic restart of mongodb works well. In particular finding automatically in which node the database is running. Seems to work but maybe a lockfile to ensure two databases don't start at once.
How to reduce the storage size of the database itself. I'm a bit baffled about this at the moment, I don't understand why the database is so damn huge. At least not upload it.
~~Remove the hash files in the working directory. Not sure whether this is possible but it is quite ugly.~~ Can't find a way to do this 🤷‍♂️

scarlehoff · 2026-01-14T10:11:11Z

This is now ready.

The only point I haven't been able to really fix is the creation of the cache hashes in the working directory. One solution is to run each worker in a separate node and only the one holding the database locally. But this is a cluster-dependent solution.

Btw, in principle you can run in parallel in different folders, nodes or computer, and the trials.json file will be always synchronized.

Radonirinaunimi

Sorry for the delay. Here are some comments.

doc/sphinx/source/n3fit/hyperopt.rst

n3fit/src/n3fit/hyper_optimization/hyper_scan.py

n3fit/src/n3fit/performfit.py

n3fit/src/n3fit/hyper_optimization/mongofiletrials.py

doc/sphinx/source/n3fit/hyperopt.rst

n3fit/src/n3fit/hyper_optimization/mongofiletrials.py

n3fit run, server now is started only if a connection to a database fails, --restart is removed, now only restart is allowed

Co-authored-by: Tanjona R. Rabemananjara <rrabeman@nikhef.nl>

scarlehoff requested review from Radonirinaunimi and tgiani January 9, 2026 14:32

Base automatically changed from update_deps_for_ruamel_019 to master January 11, 2026 12:00

scarlehoff force-pushed the fix_mongodb branch from fbd2fde to 7c06f3b Compare January 13, 2026 10:36

scarlehoff force-pushed the fix_mongodb branch 3 times, most recently from b50218f to e1f73dc Compare January 19, 2026 09:23

Radonirinaunimi reviewed Jan 27, 2026

View reviewed changes

scarlehoff and others added 5 commits January 27, 2026 11:51

fix mongodb problem, only one worker can be spawned by a single

eb800d9

n3fit run, server now is started only if a connection to a database fails, --restart is removed, now only restart is allowed

upload the database only if --upload-db is given

bcf6c2a

update the docs

757b2db

if db-host is provided, use it

643dcfd

Apply suggestions from code review

33d7aec

Co-authored-by: Tanjona R. Rabemananjara <rrabeman@nikhef.nl>

scarlehoff force-pushed the fix_mongodb branch from f5506c4 to 33d7aec Compare January 27, 2026 10:53

scarlehoff added 2 commits January 27, 2026 13:50

fix pickle test and coment on reproducibility

892fa8d

launch the mongodb instance in a context manager

8f5b7ac

Radonirinaunimi approved these changes Jan 29, 2026

View reviewed changes

scarlehoff merged commit 43a4fa6 into master Jan 29, 2026
12 of 13 checks passed

scarlehoff deleted the fix_mongodb branch January 29, 2026 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix (all?) parallel hyperopt issues#2415

Fix (all?) parallel hyperopt issues#2415
scarlehoff merged 7 commits intomasterfrom
fix_mongodb

scarlehoff commented Jan 9, 2026 •

edited

Loading

Uh oh!

scarlehoff commented Jan 14, 2026

Uh oh!

Radonirinaunimi left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

scarlehoff commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scarlehoff commented Jan 14, 2026

Uh oh!

Radonirinaunimi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scarlehoff commented Jan 9, 2026 •

edited

Loading